Transflower: probabilistic autoregressive dance generation with multimodal attention

Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, Simon Alexanderson

Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.

Subjects: Sound (cs.SD); Graphics (cs.GR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Cite as: arXiv:2106.13871 cs.SD

ダンスは、音楽のリズム、音色、時間の特徴に沿って、複雑な動きを巧みに構成することが必要です。形式的には、音楽を条件としたダンスの生成は、音声信号を条件とした高次元連続運動信号のモデル化の問題として表現できる。本研究では、この問題に取り組むために2つの貢献をしています。1つ目は、マルチモーダル変換エンコーダを用いて、将来のポーズの分布を、以前のポーズと音楽のコンテキストを条件とした正規化フローでモデル化する、新しい確率的自己回帰アーキテクチャを提示することである。次に、様々なモーションキャプチャー技術を用いて得られた、プロのダンサーとカジュアルなダンサーの両方を含む、現在最大の3Dダンスモーションデータセットを紹介する。このデータセットを用いて、我々の新しいモデルを2つのベースラインと比較し、客観的な指標とユーザー調査により、確率分布をモデル化する能力と、大規模な動きと音楽のコンテキストを考慮する能力の両方が、音楽にマッチした、興味深く、多様でリアルなダンスを生み出すために必要であることを示す。